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1  Introduction 

A  priority  scheduler  controls  access  to  a  resource  in  accordance  with  some 
priority  assignment.  Each  task  is  assigned  a  priority;  whenever  there  is  con¬ 
tention  for  a  resource,  access  is  granted  to  the  task  with  the  highest  priority 
among  those  competing.  Priority  schedulers  are  frequently  employed  in  real¬ 
time  systems  to  allocate  the  processor  among  tasks  that  must  produce  results 
in  a  timely  manner  [6,  12]. 

A  priority  inversion  occurs  when  a  lower-priority  task  delays  execution 
of  a  higher-priority  task  [5,  9].  For  example,  a  task  holding  a  write  lock  for 
some  data  will  delay  a  task  attempting  to  acquire  a  read  lock  for  that  data.  If 
the  holder  of  the  write  lock  has  a  lower  priority  than  the  task  attempting  to 
acquire  the  read  lock,  then  a  lower-priority  task  is  delaying  a  higher-priority 
task  and  a  priority  inversion  has  occurred. 

The  possibility  of  priority  inversions  creates  difficulties  for  the  designer 
of  a  real-time  system.  Not  only  must  a  task  compete  for  resources  with 
higher-priority  tasks,  but  during  a  priority  inversion,  it  loses  resources  to 
lower-priority  ones.  When  a  high-priority  task  t#  is  delayed  by  a  lower- 
priority  task  tl,  then  r#  effectively  competes  with  all  tasks  assigned  priorities 
at  least  that  of  r i,  rather  than  with  only  those  tasks  assigned  priorities  at 
least  that  of  17/ .  Since  tl  can  be  an  arbitrary  task,  establishing  that  r# 
will  meet  a  response-time  goal  requires  reasoning  involving  all  tasks  in  the 
system  rather  than  just  the  subset  with  priority  at  least  that  of  r jj. 

One  approach  to  coping  with  priority  inversion  is  to  modify  task  priorities 
dynamically  so  that  priority  inversions  are  bounded  and  short  in  length  [5, 
10],  In  these  priority  inheritance  protocols,  a  task’s  priority  is  elevated  to  a 
level  that  is  the  maximum  of  its  original  priority  and  the  priority  of  any  task 
that  is  being  delayed  by  it.  Thus,  priority  in  r  '  ons  are  permitted,  but  only 
in  a  carefully  controlled  way. 

In  this  paper,  we  explore  approaches  to  prev^hng  priority  inversion  that 
do  not  involve  modifying  task  priorities.  In  Sections  2  and  3,  we  formalize 
priority  inversion  and  give  sufficient  conditions  for  its  prevention.  Based  on 
these,  sc  ne  new  protocols  to  prevent  priority  inversions  from  occurring  are 
derived  in  Sections  4  and  5.  The  protocol  of  Section  4  is  appropriate  for  sys¬ 
tems  where  the  times  that  tasks  hold  resources  can  be  bounded;  the  protocol 
ot  Section  5  is  appropriate  for  database  systems,  where  tasks  (transactions) 
can  be  aborted.  In  Section  6,  we  consider  conditions  for  avoiding  priority 
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inversions  in  systems  where  there  are  multiple  independent  schedulers,  each 
making  allocation  decisions  for  some  subset  of  the  resources.  Section  7  puts 
our  work  in  context  and  discusses  some  unsolved  problems. 


2  System  Model 

Formalizing  priority  inversion  requires  that  we  formalize  the  notions  of  pri¬ 
ority  assignment  and  delay.  To  do  this,  we  model  a  system  as  a  set  of  tasks 
T  =  {tj,  t-2,  . . . ,  Tn }  where  a  task  is  any  computation  that  can  be  scheduled. 
Thus,  our  use  of  the  term  task  is  synonymous  with  alternatives  such  as  pro¬ 
cess. ,  job,  and  transaction. 

2.1  Task  Priorities 

A  priority  assignment  is  an  irreflexive,  partial  order1  on  T  such  that  r  >-  r' 
whenever  task  r  has  higher  priority  than  t'.  Observe  that  this  definition 
allows  tasks  to  have  incomparable  priorities.  Therefore,  it  is  possible  that 
neither  r  >-  r'  nor  r'  >-  r  holds  for  some  pair  of  tasks  r  and  r' .  By  assigning 
incomparable  priorities  to  tasks,  the  number  of  constraints  imposed  by  a 
priority  assignment  is  reduced,  avoiding  the  possibility  of  extraneous  priority 
inversions. 

Define  the  peer  group  of  a  task  r  as  the  set  of  tasks  r'  such  that  either 
r'  >~  t  or  t'  is  incomparable  to  r.  In  the  absence  of  priority  inversions,  we 
need  only  consider  r  and  tasks  that  are  in  its  peer  group  in  analyzing  whether 
r  will  satisfy  given  response-time  constraints.  This  is  because  only  tasks  in 
the  peer  group  of  r  can  cause  r  to  be  delayed. 

2.2  Resources 

Tasks  can  cause  each  other  to  be  delayed  in  a  variety  of  ways.  Some  of  these 
are  explicit,  such  as  when  one  task  awaits  a  message  sent  by  another  or  when 
a  lock  held  by  one  task  prevents  another  from  acquiring  that  lock.  Other 
causes  of  delay  are  implicit.  For  example,  the  presence  of  finite-capacity, 
time-multiplexed  resources,  such  as  memory,  processors,  and  I/O  devices, 

'An  irreflexive  partial  order  on  a  set  is  an  asymmetric,  irreflexive,  transitive  relation 
on  pairs  of  elements  from  that  set. 
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can  lead  to  the  (implicit)  delay  of  a  task  requiring  use  of  a  resource  by  the 
task  using  that  resource. 

For  our  purposes  we  can  abstract  from  these  particulars,  postulating  that 
a  system  comprises  a  set  of  resources  and  a  scheduler 2 .  A  task  obtains  access 
to  a  single  unit  of  a  resource  r  by  invoking  the 

request (r) 

operation  and  relinquishes  that  access  by  invoking  the 
release(r) 

operation  of  the  scheduler.  Execution  of  a  request  operation  delays  the  in¬ 
voker  until  that  request  can  be  granted.  Whether  a  request  is  granted  is 
determined  by  the  scheduler.  A  task  may  have  only  one  outstanding  request 
at  a  time.  Thus,  if  multiple  units  of  a  resource  or  several  different  resources 
are  needed  for  execution,  a  task  must  request  and  acquire  them  sequentially. 
We  discuss  the  possibility  of  a  task  requesting  multiple  resources  through  a 
single  operation  in  Section  7.3. 

Request  and  release  operations  are  not  always  explicitly  invoked  by  tasks. 
Sometimes,  these  operations  are  invoked  implicitly  as  part  of  some  other  sys¬ 
tem  operation.  For  example,  an  operation  to  receive  a  message  will  implicitly 
invoke  a  request  operation  that  is  granted  when  the  message  becomes  avail¬ 
able  for  receipt.  In  other  cases,  request  or  release  might  not  be  invoked 
by  tasks  at  all,  instead  being  invoked  due  to  other  activity  in  the  system. 
Consider  a  multiprogrammed  processor  that  uses  an  interval  timer  to  force 
task  switches.  An  execution  of  the  interval-timer  interrupt  handler  can  be 
regarded  as  (i)  performing  a  release  and  a  subsequent  request  for  the  task 
that  was  executing  when  the  timer-interrupt  occurred  and  then  (ii)  granting 
the  request  for  the  task  that  is  next  selected  for  execution  on  the  processor. 

Our  request/ release  model  turns  out  to  be  quite  general.  It  can  even 
be  used  to  describe  situations  in  which  tasks  are  delayed  because  of  some 
application-dependent  aspect  of  the  system  state.  For  example,  it  is  not 
unusual  for  a  concurrent  program  to  contain  some  form  of  conditional  wait 

2Most  real  systems  have  multiple  schedulers,  but  postulating  a  single  scheduler  is  not 
a  limitation.  It  is  always  possible  to  model  the  effect  of  a  collection  of  schedulers  by  using 
a  single  scheduler  that  makes  allocation  decisions  using  only  information  that  would  be 
available  to  the  relevant  scheduler. 


statement  that  delays  a  task  until  the  program  variables  satisfy  some  Boolean 
expression.  (Most  synchronization  primitives  are  instances  of  conditional 
wait  statements.)  We  can  model  such  a  conditional  wait  by  regarding  it  as  a 
request  on  a  virtual  resource.  The  request  is  granted  if  the  Boolean  expres¬ 
sion  is  true;  otherwise,  the  request  is  delayed.  Execution  of  any  assignment 
statement  in  another  task  that  makes  the  Boolean  expression  true  is  treated 
as  a  release  on  the  resource. 

2.3  Delay 

We  model  task  delays  by  a  binary  relation.  A  task  r,  waits  for  task  t}  at 
time  t.  denoted  r,  -+  Tj,  if  and  only  if  r,  is  delayed  at  time  t  in  its  request 

for  some  resource  and  t:  can  release  that  resource.  Note  that  a  task  could  be 
waiting  for  (any  of)  a  set  of  tasks  T,  each  of  which  can  release  the  resource 
being  requested.  The  definition  of  — >  allows  us  to  decompose  this  situation 

into  a  conjunction  of  waits-for  relations: 

r,  ->  T=  A(*  7  r).  (1) 

We  assume  that  a  grant/delay  decision  can  be  made  at  the  time  of  a 
request  and  that  a  delay  always  can  be  attributed  to  some  set  T  of  tasks. 
For  example,  if  there  are  multiple  units  of  the  resource  available,  the  delay  is 
attributed  to  the  set  of  all  tasks  that  have  been  granted  but  not  yet  released 
the  resource.  And,  in  the  case  of  a  request  for  a  virtual  resource  associated 
with  a  conditional  wait,  the  set  T  contains  all  those  tasks  that  can  execute 
assignment  statements  that  make  true  the  Boolean  condition  being  awaited. 


3  Characterizing  Priority  Inversion 

In  order  to  characterize  priority  inversion,  we  must  reason  about  the  transi¬ 
tive  closure  of  the  waits-for  relation.  We  say  that  a  task  r,  implicitly  waits 
for  task  r:  at  time  t ,  denoted  r,  ^  r},  if  and  only  if  either  r,  -+  r,  or  there 

exists  rfc  such  that  (r,  t*)  A  ( r *  -+  rf). 

Priority  inversion  has  occurred  at  time  t  when  progress  of  a  task  is  blocked 
by  the  actions  of  a  lower  priority  task.  Thus,  a  system  contains  a  prior- 
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Figure  1:  Four-Task  Composite  System  Graph. 

ity  inversion  at  time  t  if  and  only  if  there  exist  tasks  r,  and  r}  such  that 
(rt  y  Tj)  A  (r,  X  Tj). 

It  will  be  convenient  to  represent  the  priority  assignment  and  waits-for 
relations  in  effect  at  a  given  time  t  as  graphs.  The  directed  graph  corre¬ 
sponding  to  a  priority  assignment  X  on  a  set  of  tasks  T  is  P  =  (T,  Ep) 
where  Ep  =  {(u,v)  |  v  X  u}.  Thus,  the  nodes  of  P  represent  tasks  and  an 
edge  is  drawn  from  a  task  to  all  higher  priority  tasks.  Similarly,  the  waits- 
for  relation  at  time  t  can  be  represented  as  the  directed  graph  W  =  (T,  Ew) 
where  Ew  =  {(u,v)  |  u  y  u}.  Here,  edges  are  drawn  from  a  task  to  all 

other  tasks  that  it  waits  for. 

Given  these  graphs,  the  system  state  at  a  time  t  can  be  represented  by  a 
composite  system  graph ,  G  —  ( T ,  EpUEw)-  Figure  1  depicts  such  a  graph  for 
a  system  of  four  tasks.  Single-arrow  edges  represent  waits-for  relations  and 
double-arrow  edges  represent  priority  relations.  Thus,  the  graph  depicts  the 
situation  where  there  are  four  tasks  such  that  r3  x  Tj  x  t3  and  (rj  y  r2), 

(t2  — ►  r4),  (t4  — *  r3).  Note  that  there  can  be  multiple  edges  between  a  pair 
of  nodes. 

The  following  theorem  uses  a  composite  system  graph  to  characterize  the 
existence  of  a  priority  inversion  at  time  t.  It  is  based  on  directed  cycles 
containing  exactly  one  priority  edge.  We  call  such  cycles  r -cycles. 
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Theorem  1  A  composite  system  graph  G  describes  a  priority  inversion  if 
and  only  if  G  contains  a  directed  cycle  involving  exactly  one  priority  edge 

(~  -cycle). 

Proof:  Without  loss  of  generality,  let  r,  =>■  r,  — >  •  •  •  — »  r}  be  a  ir -cycle  of 
G.  By  definition  of  we  have  r,  r:.  From  priority  edge  r}  =?>  r,,  we  have 

the  relation  r,-  >-  ry  Thus,  the  system  contains  a  priority  inversion  according 
to  the  definition  given  above.  The  result  that  a  priority  inversion  implies  a 
7r -cycle  in  G  follows  trivially  from  the  construction  of  a  composite  system 
graph.  □ 

The  system  depicted  in  Figure  1  contains  two  priority  inversions,  corre¬ 
sponding  to  the  two  7r -cycles  (r3  Tj  — ►  r2  — *  r4  — ►  r3  and  t3  =>  t2  — ►  r4  — ► 

r3).  Task  r3  is  responsible  for  delaying  tasks  T\,  r2,  and  r4  and  has  lower 
priority  than  rj  and  r2. 

From  Theorem  1,  we  conclude  that  any  condition  that  prevents  a  com¬ 
posite  system  graph  G  from  having  a  7r -cycle  is  a  sufficient  condition  for 
avoiding  priority  inversions.  Examples  of  such  conditions  are  the  following: 

1.  No  Priority  Assignment.  If  all  tasks  were  incomparable,  then  the  pri¬ 
ority  graph  would  be  empty  and  a  tt -cycle  could  never  form. 

2.  No  Delay.  Without  delay,  the  waits-for  graph  would  be  empty,  guar¬ 
anteeing  the  absence  of  the  rr -cycles. 

3.  Preemption.  A  waits-for  relation  persists  until  a  task  relinquishes  a 
resource.  Preemption  causes  a  task  to  relinquish  a  resource,  so  pre¬ 
emption  can  change  the  waits-for  relation  in  a  way  that  prevents  a 
7r -cycle  from  forming. 

4.  No  7r -Cycle.  The  waits-for  and  the  priority  relations  must  form  in  a 
particular  fashion  for  priority  inversion  to  exist.  Theorem  1  establishes 
the  relevant  property  as  a  n -cycle. 

Note  that  preemption  both  removes  and  adds  elements  to  the  waits-for 
relation.  By  preempting  a  resource  r  that  had  been  granted  to  a  task  r,  and 
granting  r  to  rJ5  all  waits-for  relations  from  r,  to  other  tasks  are  removed 
and  waits-for  relations  from  r,  to  all  tasks  that  have  access  to  the  resource 
are  added. 


Any  strategy  for  preventing  priority  inversions  ultimately  must  he  based 
on  avoiding  the  ~ -cycle  rendition  of  Theorem  1.  In  the  strategies  that  follow, 
we  do  just  this  by  ensuring  that  at  least  one  of  the  sufficient  conditions  above 
holds.  First,  in  Section  4.  we  show  how  by  eliminating  the  possibility  of 
certain  waits-for  relations,  a  t -cycle  is  avoided.  Thus,  this  strategy  is  based 
on  condition  4,  No  rr-Cycle.  Then,  in  Section  5.  we  show  an  application 
of  preemption  by  giving  a  timestamp-based  concurrency  controller.  This 
strategy  is  based  on  condition  3,  Preemption. 

4  Applying  the  Theory:  Developing  Reser¬ 
vation  Protocols 

The  7T -cycle  condition  of  Theorem  1  can  be  prevented  by  having  each  task 
reserve  in  advance  the  interval  during  which  it  will  hold  a  resource  and  us¬ 
ing  that  information  to  prevent  certain  waits-for  relations  from  forming.  If 
reservations  from  a  low-priority  task  never  overlap  with  reservations  from 
a  higher-priority  task,  then  priority  inversions  become  impossible.  In  this 
section,  we  develop  protocols  that  exploit  this  insight  for  avoiding  priority  in¬ 
versions.  The  method  by  which  we  obtain  these  protocols  is  more  important 
than  the  protocols  themselves.  Deriving  the  protocols  provides  an  opportu¬ 
nity  for  us  to  show  how  our  theory  can  be  used  to  obtain  policies  for  avoiding 
priority  inversions  from  conditions  that  ensure  ir -cycles  are  impossible. 

Let  H,  =  (r  |  r  >-  r,}  be  the  set  of  tasks  that  have  higher  priority  than 
r,.  and  let  I,  =  (r  |  -■(r  >~  r,)  A  ->(r,  >-  r ) }  be  the  set  of  tasks  incomparable 
to  rt.  Thus,  the  peer  group  for  a  task  r,  is  PG,  =  H,  U  For  each  resource 
r.  assume  that  each  task  r,  is  able  to  compute  holdrt{t),  the  upper  bound  on 
the  amount  of  time  that  r,  will  hold  r  the  next  time  (with  respect  to  t)  it  is 
granted  r.  and  nezt'(t),  the  lower  bound  on  the  next  time  (with  respect  to  t) 
t,  will  request  r.  During  the  interval  between  when  task  r,  requests  resource 
r  and  when  it  releases  r,  define  next[(t)  to  e  4ual  t.  And,  if  r,  holds  r  at  time 
t.  then  define  holdrt(t)  to  be  an  upper  bound  on  the  remaining  amount  of 
time  r,  will  hold  r.  The  reservation  protocols  %ve  derive  require  that  at  times 
of  allocation  decisions,  the  scheduler  be  able  to  interrogate  a  set  of  tasks 
for  their  next[(t)  values  for  a  set  of  resources.  We  are  assuming  that  the 
communication  delays  between  tasks  and  the  scheduler  are  negligible  with 
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respect  to  hold\[t )  and  next\(t).  In  case  they  are  not.  the  allocation  policies 
can  he  easily  modified  to  account  for  them.1 

An  allocation  policy  can  be  derived  from  any  program  invariant4  that 
precludes  formation  of  x -cycles.  Not  only  must  the  program  invariant  imply 
that  there  is  no  rr -cycle  present  in  the  current  state,  but  also  that  the  current 
state  is  not  one  from  which  formation  of  a  tr -cycle  is  inevitable.  Therefore, 
it  suffices  that  the  program  invariant  imply  the  stronger  condition  that  there 
is  no  x-cycle  present  in  the  current  state  and  that  the  current  state  is  not 
one  from  which  formation  of  a  tr-cycle  is  possible.  Although  such  a  stronger 
program  invariant  could  rule  out  safe  states,  it  is  usually  easier  to  construct 
and  maintain. 

In  order  to  construct  a  program  invariant  that  rules  out  the  possibility  of 

future  tr  -cycles,  define  the  predicate  r}  — *•  r,  to  mean:  given  the  resources  t, 

has  allocated  at  time  t ,  it  is  possible  t}  — *•  r,  will  hold  for  some  t'  such  that 

e 

t'  >  t.  Thus,  letting  R,(t)  be  the  set  of  resources  that  r,  has  allocated  at 
time  t. 


(t’  >  t)  A  (3r:  r£  R,[t): 


( holdTt{t )  +  t  >  t')  A  {nextTJt)  <  f')). 


Note  that  if  r,  — *  rt,  then  at  time  t  there  is  some  resource  r  e  R,( t)  and  r  has 

t 

been  requested  by  t}.  Therefore,  if  t}  -+  r,,  then,  by  definition,  hold.\{t)  >  0 
and  nextj(t)  =  t,  and  so: 


4.1  Policy  1 

In  a  system  where  resources  are  shared  infrequently,  the  most  common  x- 
cycle  would  involve  just  two  tasks  r,  and  r,  (say)  such  that  Tj  — *  r,  A  Tj  >~  r,. 

An  obvious  program  invari^.nt  to  choose  is  the  negation  of  this  predicate:5 

3In  any  implementation,  hold'(t)  and  nezt'(t)  would  need  only  be  computed  on- 
demand  at  times  of  resource  allocation.  They  can  be  based  on  empirical  data  collected 
during  previous  executions  or  on  an  a  priori  analysis  of  the  code  that  is  being  executed. 

4  A  program  invariant  is  an  assertion  about  the  program  state  that  is  not  invalidated 
by  program  execution. 

'’\\>  write  A  D  B  to  denote  the  logical  implication,  A  implies  B. 
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(3) 


(Vt,.  Ty  T}  -*  r,  3  -.(Tj  X  r,)) 

However.  (3)  does  not  imply  that  there  can  be  no  tt -cycle  in  the  composite 
system  graph.  The  following  predicate  does: 

(Vr,,ry  t}  y  r,  3  -’(rJ  X  r, ) )  (4) 

So,  we  strengthen  (3).  First,  note  that  since  >-  is  asymmetric,  we  have: 

(Vr„r,:  r.j-ij  3  -<(t>  >-  t,))  (5) 

Thus. 

(Vr,,  ry  T}  -+  t,  3  T,y-Tj)  (6) 

implies  (3),  and  so  if  (6)  also  implied  (4)  then  (6)  would  imply  there  can  be 
no  --cycle  in  the  composite  system  graph.  We  now  show  that  (6)  implies 
(4). 

Assume  (6)  holds.  Since  V  is  a  transitive  relation,  we  have  from  (6)  and 
the  definition  of  that: 

t 

(Vr,.  Ty  Tj  y  T,  J  T,  y  Tj) 

The  last  equation  together  with  (3)  imply  (4).  Thus.  (6)  implies  that  there 
can  be  no  —-cycle  in  the  composite  system  graph  and  can  be  used  in  con¬ 
structing  the  program  invariant. 

Replacing  r,  — +  r,  according  to  (2)  we  get: 

(Vt„ Ty  Tj  4  r,  3  r,  >■  Tj)  (7) 

Unfortunately,  (7)  does  not  imply  that  the  current  state  is  one  from  which 
later  formation  of  7r -cycles  is  impossible.  For  example,  suppose  a  task  r,  can 
allocate  a  resource  r  that  a  task  t}  will  later  request  and  t}  >-  r,  holds.  If  re¬ 
allocates  r,  then  (7)  is  not  invalidated.  However,  if  r}  subsequently  attempts 
to  allocate  r.  then  (7)  is  invalidated  because  the  request  cannot  be  granted 
(without  preempting  r  from  r,).  So,  (7)  was  not  strong  enough  to  prevent 
subsequent  formation  of  a  tt -cycle. 

We  can  strengthen  (7)  by  weakening  its  antecedent: 
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( Vr,.  t y  (3 t':  t  <  t'\  Tj  -4  r.)  D  r,  >-  r,) 

By  construction.  this  does  prevent  the  possibility  of  x -cycles  forming  in 
subsequent  states.  Taking  the  contrapositive,  substituting  for  the  definition 
of  ,  and  performing  some  algebraic  manipulations  in  order  to  eliminate 
t'.  results  in  the  following  strengthening: 

2\:  (Vr,.  Ty  -'(r,-  ^  t})  D  (Vr:  r  £  R,(t):  holdrt(t)  +  t  <  next^t))) 

Thus,  we  can  use  1\  as  a  program  invariant  for  ensuring  that  7r -cycles  cannot 
form  during  execution. 

To  ensure  that  1\  holds  throughout  execution,  we  must  show  that  it  is 
true  initially  and  that  execution  does  not  invalidate  it.  Assuming  that  tasks 
are  initially  started  with  no  resources  allocated  to  them,  i?,( 0)  will  be  empty 
for  all  tasks  r,,  and  so  is  initially  true.  To  ensure  that  2\  is  not  invalidated 
bv  execution,  assume  that  it  is  true  and  some  task  r,  requests  a  resource  r'. 
Let  ok{r' ,  X\ )  denote  those  system  states  in  which  r'  could  be  added  to  R,{t). 
Then,  any  policy  that  ensures  a  condition  C(r',  r,)  holds  before  r'  is  granted 
to  r,,  such  that 

C ( r\  r, )  A  T\  D  ok(r',lx) 

is  valid,  will  ensure  that  If  holds  throughout  execution  (hence  the  formation 
of  x-cycles  is  precluded). 

Using  the  definition  in  [4]  of  t up  for  assignment  statement 
R,(t)  :=  R,{t)  U  (r'},  we  compute  that 

ok{r\Ix)  =  wp(Rt(t):=  R,(t)u{r'},li)- 

Thus,  for 

C(r',r,)d^  (Vry  -(r,  >-  r})  D  holdf{t)  +  t  <  next*' (<)) 
we  have 

C(r\  r.)  A  lx  Z)  ok(r',lx). 

because 

C(r,,r,)Ali  =  wp{R,(t)  :=  R,{t)  U  {r'},Ii). 
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Therefore,  provided  C{\  '.  r,)  holds  before  r'  is  granted  to  r,  we  can  conclude 
that  T\  is  not  invalidated  by  program  execution.  This  can  be  done  by  ensuring 
that  the  scheduler  never  grants  r'  to  r,  unless  C(r'.  r.)  holds.  Rearranging 
terms  in  the  definition  for  C(r',  r,)  results  in  the  following  (equivalent)  policy: 

Policy  1  The  request  of  task  r,  for  resource  r'  at  time  t  is  granted  if 

holdr'(t)  <  min  (next'' (t)  —  t). 

Otherwise,  the  request  is  delayed. 

In  words,  a  task  is  granted  a  resource  r'  only  if  it  will  release  r'  before  any 
task  in  its  peer  group  requests  r'.  Note  that  the  incomparable  tasks  must  be 
involved  in  the  allocation  decision — it  is  necessary  that  r:  range  over  I,  as 
well  as  H,. 

Policy  1  is  rather  conservative.  It  does  not  allow  waits-for  edges  to  de¬ 
velop  between  incomparable  tasks  in  fear  of  a  cycle  developing  indirectly 
between  two  comparable  tasks.  However,  if  the  priority  assignment  is  a  total 
order  and  PG,  —  H,,  then  a  reservation  is  made  with  respect  to  all  tasks 
with  higher  priority,  just  as  we  would  expect. 

4.2  Policy  2 

Another  predicate  that  implies  there  can  be  no  w-cycle  in  the  composite 
system  graph  is  the  following: 

(Vr„  Ty  Tj  >-  r,  D  --( 3rk,r( :  rt  ->  r,  A  r;  -+  rk))  (8) 

This  is  because  all  7r-cycles  contain  an  instance  of  Figure  2.  Rewriting  (8) 
and  replacing  r}  —*  r,  according  to  (2)  we  get: 

(Vr„  Ty  Tj^T,  D  ^(3rk,Tt:  rt  -4  r,  A  r,  -4  rk))  (9) 

Although  this  predicate  ensures  that  no  7r -cycle  exists  in  the  current 

state,  it  does  not  imply  that  the  current  state  is  one  from  which  formation  of 

7T -cycles  is  impossible.  As  in  the  derivation  of  Policy  1,  it  is  not  difficult  to 
construct  a  scenario  where  (9)  is  preserved  up  until  the  point  when  a  state  is 
reached  where  a  7r-cycle  is  inevitable.  And,  as  before,  the  problem  is  solved 


Figure  2:  Bad  Composite  System  Graph. 

by  strengthening.  In  particular,  observe  that  -i(3 rt,t2:  t  <  t2:  t(  r.) 

implies  -t(3r/:  r,  -4  r.)  and  that  t  <  tx  <  t2:  t}  rfc)  implies 

-'(Brit:  r,  — ►  rfc).  We  can  use  these  facts  to  strengthen  (9)  by  strengthening 
its  consequent: 

(Vr„  t}\  r,  x  Ti  D 

->(3 Tk,T(,tllt2:  t  <ti  <  t2:  T(  r,  A  r,  rfc))  (10) 

*  n 

By  using  two  separate  times  tx  and  <2,  we  characterize  scenarios  where  t}  is 
delayed  due  to  a  later  allocation  by  Tk. 

Substituting  in  (10)  according  to  the  definition  of  ^  ,  we  obtain: 

(Vr,,  ry  Tj  X  r,  D 

“’(3  Tk,Tt,tut2,r„rk:  t<tx<t2  A  r,  €  i?,(f)  A  rk  e  Rk(t  t): 
holdrt'(t)  +  t  >  t2  A  f2  >  nextTf(t)  A 
holdrkk(tx)  +  tx  >  t2  A  t2  >  nextj‘(tx))) 

Using  algebra  to  eliminate  variable  t2  and  moving  the  negation  inside  the 
quantification,  results  in  the  following  strengthening: 

(Vr,,  t y  t}  X  r,  D 
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(Vr*,T/,ii,r„r*:  t  <  tx  A  r,  G  R,(t)  A  rk  e  Rk(U): 
hold's (t)  +  t  <  next'S (t)  V  hold'k{t\)  +  tx  <  nextrJk{ti)  V 
hold'’(t )  4-  t  <  next'k(tl)  V  holdTk(tx)  +  t\  <  next\'(t))) 

Deleting  three  of  the  disjuncts  in  the  consequent  then  results  in  the  following 
strengthening. 

(Vr„  ry  Tj  X  r,-  D 

(V-rfc,  <! ,  r^,  r*:  Kii  A  ri  e  R,{t)  A  rfc  6 
hold'S(t)  4-  t  <  neif ^(tj))) 

Any  subset  of  the  disjuncts  could  have  been  deleted,  with  other  choices  lead¬ 
ing  to  different  policies. 

We  can  further  simplify  by  removing  references  to  rk  and  t\.  We  do 
this  by  strengthening  based  on  the  following  observations.  First,  because 
Rk{t)  Q  R  holds,  replacing  references  to  Rk[t)  by  R  results  in  a  stronger 
consequent.  Second,  because  t  <  t\,  we  conclude  that  next'^t)  <  next'* (t^) , 
and  so  replacing  tx  in  hold's (t)  4-  t  <  next'k(tx)  by  t  results  in  a  stronger 
consequent.  We,  therefore,  obtain: 

I2:  (Vr„  Tj\  T}  >-  t,  D 

(Vr,,  rk:  n  €  Ri(t)  A  rk  £  R:  hold'S (t)  +  t  <  next'" (t))) 

To  ensure  that  J2  holds  throughout  execution,  we  show  that  it  is  true 
initially  and  that  execution  does  not  invalidate  it.  Assuming  that  tasks  are 
initially  started  with  no  resources  allocated  to  them,  i?,(0)  will  be  empty  for 
all  tasks  t„  and  so  J2  is  initially  true.  To  ensure  that  J2  is  not  invalidated 
by  execution,  assume  that  it  is  true  and  some  task  t,  requests  a  resource  r1. 
We  desire  a  condition  C(r',  r<)  such  that 

C(r/,r,)Al2  D  ofc(r\I2) 

is  valid.  Again,  using  the  definition  of  tup  for  assignment  statement 
R,{t)  :=  Ri(t)  U  {r'},  we  can  verify  that  any  choice  for  C(r',  r.)  must  im¬ 
ply: 

(Vr,,  rk:  rk  €  R :  t}  X  t,  D  hold'- {t)  +  t  <  next'k{t)) 

Rearranging  terms  in  the  definition  for  C(r',  r,)  results  in  the  following  (equiv¬ 
alent)  policy: 
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Policy  2  Let  R  be  the  set  of  all  shared  resources.  The  request  of  task  r,  for 
resource  r'  is  granted  if 

holdf(t)  <  min  ( min(  next '*(<)  —  t)). 

Tj€H,  rkeR  J 

Otherwise,  the  request  is  delayed. 

Thus,  a  task  is  granted  a  resource  r'  only  if  it  will  release  r'  before  any 
higher-priority  task  requests  any  resource. 

One  way  Policies  1  and  2  can  be  compared  is  by  considering  the  composite 
systems  graphs  one  policy  allows  and  the  other  policy  avoids.  In  a  system 
with  a  total  order  priority  assignment,  Policy  1  is  superior,  while  in  a  system 
where  there  are  many  incomparable  tasks  and  few  shared  resources,  Policy  2 
is  appropriate.  Depending  on  the  application,  there  may  be  other  properties 
of  the  possible  composite  systems  graphs  that  can  be  exploited  in  order  to 
derive  a  better  resource  allocation  policy. 


5  Applying  the  Theory:  Developing  a  Con¬ 
currency  Control  Protocol 

We  now  consider  a  strategy  in  which  the  ir -cycle  condition  of  Theorem  1  can 
be  prevented  in  a  database  system  by  designing  a  timestamp-based  concur¬ 
rency  controller  that  prevents  priority  inversions.  Other  database  concur¬ 
rency  controllers  that  avoid  priority  inversions  use  locking  [1,  10].  Thus,  by 
applying  our  theory,  we  have  been  able  to  derive  the  first  timestamp-based 
concurrency  controller  for  avoiding  priority  inversions. 

5.1  Serializability  and  Priority  Inversion 

A  task  accesses  a  database  by  encapsulating  its  reads  and  writes  on  the  data¬ 
base  within  a  transaction.  A  concurrency  controller  schedules  these  transac¬ 
tions  so  that  their  execution  is  serializable — that  is,  each  transaction  either 
commits  or  aborts,  and  execution  of  the  committed  transactions  is  equiv¬ 
alent  to  executing  them  in  some  serial  order.  A  transaction  r,  precedes  a 
transaction  r}  in  a  history  h,  written  r,<^rv  if  in  all  serial  executions  equiv¬ 
alent  to  h ,  r,  executes  before  r}.  Serializability  of  a  set  of  transactions  T 
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can  be  characterized  in  terms  of  a  serialization  graph ,  G  =  ( T,Ec )  where 
Ec  =  { ( u,  t')  Ju<§;t'}.  By  definition,  <C  is  an  irreflexive  partial  order,  so  G  is 
acyclic. 

By  allowing  only  serializable  executions,  a  concurrency  controller  ensures 
that  the  serialization  graph  G  never  contains  cycles  [2].  For  example,  suppose 
some  operation  a,  of  r,  was  executed  before  a  conflicting  operation  3j  of  r}, 
where  two  operations  conflict  if  their  execution  order  cannot  be  interchanged 
without  altering  the  effects  of  one  or  the  other.  In  this  case,  a  concurrency 
controller  effectively  adds  an  edge  from  r,  to  t}  in  G.  If  two  more  conflicting 
operations,  7,  of  r,  and  6}  of  77,  are  to  be  executed,  then  7,  must  execute 
before  6j,  since  to  do  otherwise  would  add  an  edge  in  G  from  t}  to  r,,  creating 
a  cycle. 

Delays 

One  way  to  ensure  that  a  serialization  graph  never  contains  cycles  is  by  de¬ 
laying  execution  of  transactions.  Henceforth,  we  will  assume  that  if  r,  — >  r} 

due  to  a  delay  introduced  by  a  concurrency  controller,  then  rJ<Cr1-  This  is 
a  reasonable  assumption  because  were  it  not  true,  then  there  would  exist 
an  equivalent  serial  execution  in  which  t\  precedes  t}  yet  the  concurrency 
controller  delays  some  operation  a ,•  of  r,  until  operation  of  Tj  completes. 
Such  a  delay  would  be  capricious,  since  all  the  operations  of  r,  could  execute 
before  any  operation  of  t}  and  yield  the  same  result  as  when  a;  is  delayed.  It 
is.  therefore,  not  surprising  that  all  the  concurrency  control  algorithms  that 
we  know  of  satisfy  this  assumption. 

By  delaying  transactions,  a  concurrency  controller  can  cause  priority  in¬ 
versions.  Define  a  priority -ordered  concurrency  controller  to  be  one  that 
ensures 

POCCd:  For  all  transactions  t,  and  r;,  if  Tj  starts  before  r,  completes  and 
Tj  >  r,,  then  t}  <  t,. 

POCCi  ensures  that  higher-priority  transactions  are  ordered  before  concur¬ 
rently  executed  lower-priority  ones. 

We  now  show  that  a  priority-ordered  concurrency  controller  cannot  intro¬ 
duce  priority  inversions:  By  assumption,  delay  edges  that  are  in  a  composite 
system  graph  and  can  be  attributed  to  the  concurrency  controller  have  cor¬ 
responding  conflict  edges  in  the  serialization  graph.  By  POCCi,  for  every 
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priority  edge  in  the  composite  system  graph  there  is  a  corresponding  conflict 
edge  in  the  serialization  graph.  Thus,  there  is  a  it  -cycle  in  the  composite 
system  graph  only  if  there  is  a  cycle  in  the  serialization  graph.  Since  a  con¬ 
currency  controller  ensures  the  absence  of  cycles  in  the  serialization  graph, 
there  can  be  no  it -cycle  in  the  composite  system  graph. 

Aborts 

In  addition  to  delaying  transactions,  some  concurrency  controllers  avoid  cy¬ 
cles  in  the  serialization  graph  by  aborting  transactions.  Aborting  a  transac¬ 
tion  can  cause  a  priority  inversion — for  example,  when  a  lower-priority  trans¬ 
action  causes  a  higher-priority  transaction  to  abort.  In  terms  of  our  model, 
aborting  a  transaction  can  be  regarded  as  having  it  wait  for  some  (virtual) 
resource  held  by  other  transactions.  This  is  sensible  because  a  transaction 
ra  that  is  aborted  by  the  concurrency  controller  and  later  rescheduled  does 
no  useful  work  prior  to  its  restart,  and,  in  our  model,  a  transaction  that  can 
do  no  useful  work  is  considered  blocked. 

Whether  or  not  aborting  ra  will  cause  a  priority  inversion  depends  on  the 
priorities  of  transactions  holding  the  virtual  resource  that  is  being  requested 
by  (and  is  blocking)  r„.  The  set  Ba  of  transactions  holding  this  virtual 
resource  are  those  that,  had  they  not  executed  at  all,  would  not  have  led  to 
ra  being  aborted.  Thus,  if  each  transaction  in  Ba  is  in  the  peer  group  of  ra, 
then  aborting  ra  to  avoid  a  cycle  in  the  serialization  graph  does  not  create  a 
priority  inversion.  Let  C  be  the  set  of  transactions  involved  in  cycles  in  the 
serialization  graph,  and  let  A  be  the  subset  of  C  that  are  aborted  in  order 
to  remove  those  cycles.  Then,  Ba  C  C  —  A  holds,  and  we  have: 

POCC2:  A  concurrency  controller  that  aborts  a  transaction  r0  will  avoid 
priority  inversions  provided  Ba  is  a  subset  of  the  peer  group  of  rQ. 

5.2  Timestamp-based  Concurrency  Controllers 

Timestamp-baaed  concurrency  controllers  work  by  assigning  a  unique  time- 
stamp  ts(T,)  to  each  transaction  r,.  These  timestamps  are  used  to  totally 
order  transactions.  The  ordering  is  such  that  if  ts(rj)  <  ts(r}),  then  there  is 
an  edge  in  the  serialization  graph  from  r,  to  Tj. 

Timestamp-based  concurrency  controllers  both  delay  transactions  and 
abort  transactions.  In  order  to  ensure  that  such  a  concurrency  controller  does 


not  introduce  priority  inversions,  two  conditions  suffice.  The  first  condition 
is  that  the  concurrency  controller  be  priority-ordered,  since  being  priority- 
ordered  implies  that  delays  will  not  introduce  priority  inversions.  The  second 
is  that  aborts  do  not  introduce  priority  inversions.  We  now  consider  each  of 
these  conditions  in  detail. 

A  timestamp-based  concurrency  controller  can  be  made  priority-ordered 
by  suitable  assignment  of  timestamps.  This  is  because  POCCi  requires  that 
certain  edges  exist  in  a  serialization  graph.  Since  the  concurrency  controller 
adds  such  edges  by  assigning  timestamps,  it  suffices  to  ensure  that 

POTS:  If  Tj  starts  before  completes  and  r,  >-  r,,  then  ts(Tj)  <  ts(r,). 

It  is  not  hard  to  assign  such  timestamps.  For  simplicity,  assume  integer 
priorities;  extension  to  general  priorities  is  straightforward.  Also  assume 
there  exist  P  priority  levels  1,2,...,P,  where  level  £,  >-  tj  if  and  only  if 
t,  >  t},  and  assume  that  timestamps  are  real  numbers.  Let 

max?  be  the  largest  timestamp  of  all  committed  transactions, 

min °  be  the  smallest  timestamp  of  all  active  transactions  of 
priority  i,  and 

max?  be  the  largest  timestamp  of  all  active  transactions  of  pri¬ 
ority  i. 

If  no  transactions  of  priority  i  are  active,  then  the  value  of  min “  and  maij 
is  defined  to  be  J_,  where  min(x,_L)  =  max(i,i)  =  x.  Initially,  max f  =  0.0 
and  for  all  priority  levels  t,  minat  =  maxf\  =  JL.  A  timestamp  s  for  a  new 
transaction  with  priority  p  can  be  computed  by  finding  upper  and  lower 
bounds  for  its  value  and  selecting  a  unique  value  in  that  interval.  To  be 
able  to  commit,  the  value  of  s  must  be  larger  than  max?\  to  satisfy  POTS,  it 
must  be  larger  than  the  timestamps  assigned  to  all  transactions  with  higher- 
priority  and  smaller  than  the  timestamps  assigned  to  all  transactions  with 
lower  priority.  This  is  implemented  by  the  code  in  Figure  3. 

Having  ensured  that  the  concurrency  controller  is  priority-ordered,  it  only 
remains  to  ensure  that  aborts  do  not  introduce  priority  inversions.  Suppose 
operation  a,  from  transaction  r,  is  submitted  before  a  conflicting  operation 
3}  from  Tj.  If  ts(Ti)  >  ts(r}),  then  executing  these  operations  in  the  order 
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low  :=  ma \[maxc ,  maxf.  1  <  i  <  p]\ 
high  :=  min[mm£:  p  <  l  <  P]; 
if  (high  =  X)  then  s  :=  low  +  1 

else  s  :=  (low  4-  high)/ 2; 
mint  min[miri(,  s]; 
max t  :=  max[max“,  s]; 

Figure  3:  Timestamp  Allocator 

they  are  submitted  would  create  a  cycle  in  the  serialization  graph6.  Thus,  the 
concurrency  controller  must  abort  either  r,  or  t,  to  avoid  this  cycle.  From 
POTS,  if  Tj  y-  r,  then  ts(T3 )  <  ts(r,)  holds,  so,  according  to  POCC2,  no 
priority  inversion  can  result  by  aborting  hs(Tj).  The  following  rule,  there¬ 
fore,  describes  a  rule  for  aborting  transactions  without  introducing  priority 
inversions. 

Priority  Abort  Rule:  If  a,  from  transaction  r,  is  submitted  before  con¬ 
flicting  operation  from  t}  and  ts(r,)  >  *s{rf),  then  r,  is  aborted  to 
avoid  a  cycle  in  the  serialization  graph. 

Notice  that  this  rule  is  the  opposite  of  what  is  traditionally  used  in  timestamp- 
based  concurrency  controllers  [2].  Traditionally,  the  transaction  that  submit¬ 
ted  the  last  operation  (e.g.  t}  above)  would  be  aborted  because  this  elim¬ 
inates  all  cycles  introduced  by  that  operation.  For  example,  suppose  that 
ts(r,)  =  i  and  the  following  sequence  of  operations  are  submitted,  where 
r,[xj  denotes  an  operation  by  transaction  Tj  to  read  x  and  u>,-[x]  denotes  an 
operation  by  transaction  Tj  to  write  x: 

r2[x]  r3[x]  Wi[x], 

Performing  u>i[x]  would  introduce  two  cycles  in  the  serialization  graph — one 
involving  Tj  and  r2,  the  other  involving  t i  and  t3.  The  traditional  abort  rule 
would  abort  Tj,  but  would  create  a  priority  inversion  if  T!  >-  t2  or  T\  t3. 
The  Priority  Abort  Rule  would  abort  both  r2  and  r3  and  cannot  cause  a 
priority  inversion  because  of  the  way  timestamps  are  assigned. 

6  Actually,  if  both  operations  are  writes,  the  second  write  can  be  ignored  and  no  conflict 
occurs  [1 1], 
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Implementing  the  Priority  Abort  Rule  is  not  completely  straightforward. 
See  [7]  for  a  detailed  explanation  of  such  an  implementation. 


6  Systems  with  Multiple  Schedulers 

It  is  not  uncommon  for  system  resources  to  be  partitioned  into  disjoint  sub¬ 
sets,  where  each  subset  is  controlled  by  an  independent  scheduler.  For  ex¬ 
ample,  it  is  typical  for  the  processors  of  a  system  to  be  scheduled  by  a  CPU 
scheduler,  disk  drives  to  be  scheduled  by  some  device  driver  module,  and 
database  accesses  to  be  scheduled  by  the  concurrency  control  module  of  a 
database  manager.  A  database  transaction  would  interact  with  all  three 
separate  schedulers  during  its  execution. 

Although  using  multiple,  independent  schedulers  simplifies  construction 
of  a  system,  it  complicates  the  avoidance  of  priority  inversions.  This  is  be¬ 
cause  a  scheduler’s  decision  to  delay  granting  some  resource  to  a  task  must 
be  made  using  only  partial  information  about  the  system  state,  and  a  delay 
caused  by  one  scheduler  might  cause  a  priority  inversion  with  tasks  being 
delayed  by  other  schedulers.  To  see  this,  consider  a  system  with  schedulers 

51  and  S2  and  tasks  r,,  r,,  and  r*,  where  r,  >-  t*.  Further,  suppose  that 
an  allocation  decision  by  Si  leads  to  r,  — +  r;  and  an  allocation  decision  by 

52  leads  to  Tj  —+  r^.  This  is  a  priority  inversion  since  (r,  r*)  A  (r,  V-  r*). 

Notice  that  neither  Sj  nor  S 2  maintains  sufficient  local  information  to  detect 
or  prevent  this  priority  inversion. 

We  can  enjoy  the  benefits  of  separate  schedulers  if,  by  analyzing  each 
scheduler  in  isolation,  freedom  from  priority  inversions  can  be  ensured.  An 
obvious  local  criterion  for  correctness  of  a  scheduler  S  is  that  S  prevent  pri¬ 
ority  inversions  among  tasks  that  have  requested  but  not  released  resources 
from  S.  The  two-scheduler  example  of  the  previous  paragraph  illustrates 
that  this  criterion  by  itself  is  not  sufficient  to  ensure  freedom  from  priority 
inversion — both  Si  and  S2  avoided  such  local  priority  inversions.  We,  there¬ 
fore,  now  investigate  useful  conditions  to  ensure  that  avoiding  local  priority 
inversions  is  sufficient  for  avoiding  all  priority  inversion. 

Consider  a  system  in  which  there  is  a  set  of  independent  schedulers  and 
a  single  priority  assignment  that  is  known  to  all.7  Define  r,  s-+  t}  to  hold  if 

7The  case  where  schedulers  do  not  share  a  common  priority  assignment  is  discussed  in 
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and  only  if  r,  — +  r}  and  scheduler  S  is  delaying  r,'s  request  for  some  resource. 

The  local  state  of  a  scheduler  5  can  be  characterized  by  a  directed  graph 
Gs  =  ( T ,  E$ )  where  Es  =  {(u.  v)  \  v  u  V  u  s-+  c}.  Thus,  Gs  includes  all 

of  the  priority  edges  of  the  composite  system  graph  but  only  a  subset  of  the 
waits-for  edges.  A  local  priority  inversion  exists  for  scheduler  S  if  and  only 
if  Gs  contains  a  tr-cycle. 

Since  the  composite  system  graph  for  a  system  with  multiple  schedulers 
is  given  by  G  =  (T,  Us  Es),  a  system  contains  a  priority  inversion  at  time  t 
if  and  only  if  Us^s  contains  a  7: -cycle.  However,  as  illustrated  in  the  two- 
scheduler  example  above,  existence  of  a  z -cycle  in  IJs  Gs  does  not  imply  a 
--cycle  in  Gs  for  some  scheduler  5.  The  following  theorem  shows  that  this 
discrepancy  is  linked  to  specifying  priority  assignments  with  partial  orders. 

Theorem  2  A  system  with  multiple  schedulers  is  free  from  priority  inversion 
if  the  priority  assignment  is  a  total  order  and  each  scheduler  avoids  local 
priority  inversions. 

Proof:  By  contradiction.  Assume  that  the  system  has  a  priority  inversion 
characterized  by  the  7r-cycle 

T,  ^  Tj  ->  ■  ■  ■  ->  Tfc  -+  T(  -»  ■  •  •  -*•  T, 

in  the  composite  system  graph.  For  each  r*,  — *  ry  in  the  7T-cycle,  we  conclude 
Tf  y-  rh  because  every  pair  of  tasks  is  related  by  the  priority  assignment 
and  no  scheduler  allows  a  local  priority  inversion.  By  the  transitivity  of  >-, 
we  have  r,  >-  ry  But,  from  r,  =>  r,  we  conclude  r}  >-  r,  and  obtain  the 
contradiction  that  r,  >-  and  r}  >~  rt.  □ 

Even  in  systems  where  the  priority  assignment  is  a  partial  order,  it  is  pos¬ 
sible  to  design  schedulers  that  use  only  local  information  yet  still  manage  to 
avoid  all  priority  inversions.  The  strategy  is  for  schedulers  to  be  conservative 
and  never  permit  a  local  configuration  necessary  for  a  7r-cycle  to  develop.  An 
example  of  such  a  strategy  is  given  by  the  following  theorem. 

Theorem  3  If  a  task  r,  is  never  allowed  to  wait  when  there  exists  r,  such  that 
r,  >-  Tj,  then  the  system  is  guaranteed  to  be  free  from  all  priority  inversions. 

Section  7.2. 
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Proof:  The  stated  allocation  policy  prevents  executions  that  lead  to  => 
t,  r  for  any  r.  Since  this  is  a  necessary  configuration  for  a  rr-cycle,  the 
system  cannot  contain  priority  inversions.  □ 

A  symmetric  policy,  which  prevents  r  — »  Tj  =>  r,  configurations,  also 
works.  However,  both  policies  may  degenerate  to  "do  not  allocate  any  re¬ 
source  to  any  task."  a  policy  that  is  not  very  useful.  The  simple  system  with 
three  tasks  r,-,  Tj,  rk  and  a  priority  assignment  where  r;  >-  r,  and  rk  >-  r, 
illustrates  the  problem  with  the  first  policy.  Suppose  all  three  tasks  share  a 
single  processor.  Task  r,  cannot  be  scheduled  since  this  allocation  decision 
would  lead  to  ( Tj  — >  r,)  A  (t}  >~  r(),  a  priority  inversion.  Task  r,  cannot 

be  scheduled  since  it  would  lead  to  rfc  — ►  r},  which  violates  the  scheduling 

policy  that  a  task  (r*)  is  never  allowed  to  wait  when  there  exists  a  task  (r,) 
with  lower  priority.  Finally,  task  rk  cannot  be  scheduled  for  the  symmetric 
argument.  In  other  words,  none  of  the  three  tasks  can  run  even  though  two 
have  incomparable  priorities! 

In  many  systems,  the  priority  assignment  is  encoded  by  associating  an 
integer  priority  II(r)  with  each  task  r.  If  each  task  is  assigned  a  unique 
priority,  then  the  result  is  a  total  order  and  Theorem  2  implies  that  avoiding 
local  priority  inversions  is  sufficient  for  avoiding  all  priority  inversions.  How¬ 
ever,  constructing  a  total  order  from  a  partial  order  can  require  introduction 
of  fictitious  priority  relations — avoiding  priority  inversions  that  involve  these 
fictitious  relations  is  unnecessary.  Thus,  we  now  consider  the  case  where  a 
unique  priority  is  not  assigned  to  each  task,  but  II  does  satisfy  the  follow¬ 
ing  less-restrictive  conditions,  which  define  a  partial  order  (as  opposed  to  an 
irreflexive  partial  order). 

Pi.  ri(r)  >  n(r')  if  and  only  if  r  V  r' 

P2.  n(r)  =  Il(r')  implies  ->(r  >-  r')  A  -i(r'  V  r). 

Observe  that  for  a  given  priority  assignment,  a  mapping  that  satisfies 
Pi  and  P2  might  require  adding  some  fictitious  priority  relations,  but  would 
require  adding  fewer  priority  relations  than  if  II  defined  a  total  order.  The 
following  theorem  asserts  that  even  though  II  defines  a  partial  order,  due 
to  Pi  and  P2.  avoiding  local  priority-inversions  suffices  to  avoid  all  priority 
inversions. 
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Theorem  4  If  a  task  r  is  only  allowed  to  wait  for  a  task  r'  for  which 
Tl(r)  <  IT(r')  then  the  system  is  guaranteed  to  he  free  from  all  priority  inver¬ 
sions. 


Proof:  By  contradiction.  Assume  a  priority  inversion  characterized  by 
the  T-cycle 

T,  =>  Tj  —*■■■—+  Tk  —*  T(  —¥■■■—*  T, 

in  the  composite  system  graph.  From  the  allocation  policy,  rk  — ►  rt  imphes 
n(r,)  >  n(r*)  for  any  two  tasks  rk  and  r(  in  the  7t-cycle.  By  transitivity, 
we  have  II(rt)  >  I^r,).  By  the  hypothesis,  r}  >-  r,  and  thus  by  IT  we  have 
n(r,)  >  n(r,),  a  contradiction.  □ 

Note  that  the  problem  illustrated  above  for  the  policy  of  Theorem  3 
no  longer  exists.  One  possible  mapping  that  corresponds  to  the  priority 
assignment  of  the  example  is  II( t, )  =  l.II(r,)  =  II(rfc)  =  2.  Since  tasks 
r}  and  rk  have  equai  priority  levels,  either  one  could  be  scheduled  without 
risking  a  priority  inversion  in  the  global  system  state. 

7  Discussion 

The  characterization  of  priority  inversion  given  above  is  useful  only  to  the 
extent  that  the  formal  model  on  which  it  is  based  correctly  captures  the 
relevant  aspects  of  reality.  We.  therefore,  now  discuss  the  suitability  of  our 
model  and  the  relaxation  of  certain  of  its  restrictions. 

7.1  Priority  Assignments  as  Partial  Orders 

We  have  elected  to  formalize  priority  assignments  using  irreflexive  partial 
orders  rather  than  mappings  from  tasks  to  integers  (as  is  done  in  many 
operating  systems).  This  selection  was  made  because  irreflexive  partial  orders 
are  more  expressive.  For  example,  there  is  no  mapping  of  tasks  to  integers  to 
state  that  a  task  r3  competes  on  an  equal  footing  with  both  r2  and  r3  but  r3 
has  priority  over  r2.  Such  a  mapping  'F  would  have  to  satisfy  'F(t1)  =  't(r2). 
'F(ri)  =  ^(t3)  and  'I,(r3)  >  Also,  using  an  irreflexive  partial  order 

avoids  introducing  fictitious  priority  relations,  which,  in  turn,  avoid  fictitious 
priority  inversions. 
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When  irrefiexive  partial  oiders  are  used  to  specify  priority  assignments, 
there  are  two  possible  interpretations  for  the  case  where  two  tasks  have  in¬ 
comparable  priorities.  One  is  that  these  tasks  do  not  compete  for  resources; 
the  other  is  that  these  tasks  compete  for  resources  but  on  an  equal  footing. 
In  either  case,  if  two  tasks  are  incomparable  then,  by  definition  (see  Sec¬ 
tion  2.1)  each  will  be  in  the  oeer  group  of  the  other.  At  first,  including  in 
the  peer  group  for  r  tasks  that  do  not  compete  for  resources  with  r  might 
seem  troubling.  However,  if  two  tasks  do  not  compete  for  resources,  then  it 
doesn't  matter  whether  one  is  in  the  peer  group  of  the  other — any  analysis 
based  on  that  peer  group  will  be  no  more  complicated  by  the  presence  of  the 
non-competing  task. 

Using  irreflexive  partial  orders  to  specify  priority  assignments  does  have 
drawbacks.  As  shown  in  Section  6,  using  individual  schedulers  that  ensure 
freedom  from  local  priority  inversion  does  not  by  itself  guarantee  that  a  sys¬ 
tem  will  be  free  of  priority  inversions.  However,  Theorem  4  shows  that  if 
the  priority  assignment  is  restricted  to  one  that  could  be  represented  by  a 
mapping  from  tasks  to  integers,  guaranteeing  freedom  from  local  priority  in¬ 
versions  does  guarantee  freedom  from  all  priority  inversions.  Thus,  in  systems 
with  multiple,  independent  schedulers,  there  are  advantages  to  employing  the 
less-expressive  formulation  of  priority  assignment. 

7.2  Static  and  Global  Priority  Assignment 

One  limitation  of  our  model  is  the  assumption  of  a  single,  static  priority 
assignment.  This  rules  out  systems  where  a  task’s  priority  is  a  function  of  the 
system  state.  It  also  rules  out  systems  with  multiple  independent  schedulers 
that  each  assign  different  priorities  to  tasks.  A  time-varying  or  dynamic 
priority  structure  can  be  modeled  as  a  sequence  of  priority  assignments, 

>-i,  >-2, . . . ,  >~t,  ■  ■  ■ 

where  >~t  is  the  priority  assignment  in  effect  at  time  t.  The  formal  characteri¬ 
zation  of  priority  inversion  in  Section  3  remains  valid  with  this  extension,  but 
the  protocols  of  Sections  4  and  5  requir  ’  modification.  This  is  because  if  the 
priority  assignment  is  not  static,  priority  inversions  can  be  caused  simply  by 
changing  the  priority  assignment;  with  a  static  priority  assignment,  a  priority 
inversion  can  only  occur  from  an  (ungranted)  request  operation.  Avoiding 
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priority  inversions  for  a  dynamic  priority  structure,  therefore,  requires  that 
changes  to  the  priority  assignment  be  coupled  with  the  waits-for  relation. 

Our  definition  of  priority  inversion  does  not  work  for  the  case  where  dif¬ 
ferent  schedulers  can  assign  different  priorities  to  tasks.  To  understand  the 
problem,  consider  a  system  with  two  resources — a  processor  and  a  communi¬ 
cations  channel — and  two  tasks  rt  and  r2.  Suppose  the  priority  assignment 
>- p  used  by  the  processor  has  rt  >~p  t2  and  the  priority  assignment  >~c  used 
bv  the  channel  has  r2  T\-  It  is  oossible  for  T\  — >  r2  due  to  an  allocation 

t 

decision  by  the  communications  channel  and  for  r2  — ►  T\  due  to  an  allocation 

decision  by  the  processor.  We  believe  that  this  scenario  should  not  be  con¬ 
sidered  a  priority  inversion  because  no  task  is  being  prevented  from  using  a 
resource  by  a  task  that  has  a  lower  priority  for  the  use  of  that  resource.  How¬ 
ever,  (rv  >-  r2)  A  (rt  — *  r2)  holds,  which  means  there  is  a  priority  inversion 

according  to  the  definition  of  Section  3. 

7.3  Multiple-unit  Resource  Requests 

Another  limitation  of  our  model  is  the  requirement  that  tasks  request  in¬ 
dividual  resources  sequentially.  As  a  result  of  this  limitation,  we  have  not 
had  to  define  what  constitutes  a  priority  inversion  when  a  task  can  request 
multiple  resources  simultaneously.  There  is  good  reason  for  this  omission — it 
is  not  clear  what  the  correct  definition  should  be.  Consider  a  system  con¬ 
sisting  of  two  processors  Pi  and  P2  and  three  tasks  rt,  r2  and  r3.  Further, 
suppose  T\  >■  r2,  Tj  requires  two  processors,  and  r2  and  r3  each  require  a 
single  processor.  It  seems  reasonable  to  claim  that  t2  executing  on  P\  while 
P2  is  idle  constitutes  a  priority  inversion,  since  r2  holds  a  resource  required 
by  higher-priority  task  rj.  What  is  not  clear  is  whether  r3  executing  on  Pi 
while  P2  is  idle  should  also  be  considered  a  priority  inversion.  On  the  one 
hand,  no  task  holds  resources  required  by  a  higher-priority  task,  suggesting 
that  this  should  not  be  considered  a  priority  inversion.  On  the  other  hand, 
if  this  is  not  considered  a  priority  inversion  then  the  seemingly  harmless  act 
of  putting  idle  processor  P2  to  work  executing  r2  should  not  be  considered 
a  priority  inversion,  either.  Yet,  r2  now  holds  a  resource  required  by  iq, 
implying  that  a  priority  inversion  exists. 
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7.4  Bounded  Priority  Inversions 

This  paper  has  been  concerned  with  avoiding  priority  inversions  completely. 
However,  it  is  sometimes  acceptable  for  priority  inversions  to  occur,  provided 
they  are  bounded  in  duration.  Suppose  a  task  is  being  delayed  by  some 
lower  priority  task.  If  the  duration  of  this  delay  has  a  known  bound,  then  we 
could  view  the  situation  as  if  the  delay  were  included  in  the  fixed  overhead 
associated  with  allocating  the  resource.  So,  by  adding  to  the  cost  of  a  request 
operation  any  delay  due  to  a  priority  inversion,  an  analysis  using  peer  groups 
would  remain  valid  (despite  the  priority  inversion).  The  priority  inheritance 
protocols  in  [10],  for  example,  provide  a  way  to  bound  priority  inversions 
under  certain  circumstances;  the  protocols  of  [8]  prevent  priority  inversions 
of  unbounded  duration. 

Define  a  D-bounded  priority  inversion  to  be  a  priority  inversion  whose 
duration  is  not  longer  than  D  and  an  unbounded  priority  inversion  to  be  one 
whose  duration  cannot  be  so  bounded.  Avoiding  all  but  P-bounded  priority 
inversions  can  be  easier  and  less  costly  than  avoiding  all  priority  inversions.8 
For  example,  detection  can  be  used  to  eliminate  all  but  Z?-bounded  priority 
inversions — tasks  run  their  course  and  periodically  a  protocol  is  executed 
to  detect  and  eliminate  priority  inversions  that  may  have  formed.9  The 
frequency  with  which  the  detector  runs  determines  the  upper  bound  D  on 
the  duration  of  priority  inversions. 

In  theory,  it  is  easy  to  build  a  detector  for  priority  inversions.  The  detector 
must  construct  the  composite  system  graph  from  the  information  available 
to  schedulers.  In  a  distributed  system  with  multiple  independent  schedulers, 
a  distributed  snapshot  algorithm  [3]  would  have  to  be  employed  for  this 
purpose.  Priority  inversion  detection  is  then  simply  a  matter  of  checking  for 
x-cycles  in  the  composite  system  graph  of  a  snapshot.  Note,  however,  that 
a  priority  inversion  might  have  vanished  by  the  time  it  is  detected  because 
occurrence  of  a  7r-cycle  is  not  a  stable  property  [3].  Such  “ghost’*  priority 
inversions  do  not  cause  problems,  however,  because  they  are  not  unbounded 
priority  inversions. 

8  Analyzing  a  scheduling  protocol  to  determine  a  bound  D  can  be  a  hard  problem, 
however  [10]. 

Elimination  of  the  priority  inversion  requires  either  that  resource  allocations  to  tasks 
be  preemptabie  or  that  task  executions  be  abortable. 
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8  Conclusions 

This  paper  gives  a  formal  characterization  of  priority  inversion  and  gives  a 
set  of  sufficient  conditions  for  its  avoidance.  Based  on  these  conditions,  we 
have  been  able  to  derive  new  protocols  to  avoid  priority  inversion.  We  have 
also  been  able  to  give  conditions  to  avoid  priority  inversion  in  systems  with 
multiple  schedulers  that  do  not  communicate. 

The  existence  of  a  theory  characterizing  priority  inversions  makes  it  possi¬ 
ble  to  design  both  general  and  application-specific  protocols  to  avoid  priority 
inversions.  The  theory  also  permits  the  consequences  of  system  design  deci¬ 
sions  to  be  better  understood.  For  example,  we  were  surprised  to  find  that 
choosing  between  an  irreflexive  partial  order  and  an  integer  mapping  repre¬ 
sentation  of  priority  assignment  can  be  significant.  We  were  also  surprised 
that  the  definition  of  priority  inversion  is  elusive  for  systems  in  which  mul¬ 
tiple  resource  requests  are  possible  or  in  which  independent  schedulers  use 
different  priority  assignments. 
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